Introduction

In this tutorial, we will cover some basics about how to apply network analyses to ecological communities. After this practical you should be able to know how to construct a network, represent it, and perform basic analyses. Keep in mind that this tutorial is just an introduction to the topic of network analyses, if you are more interested, there are plenty of online materials but also do not hesitate to contact us for more.

All this tutorial is run in R and RStudio. However, DON’T PANIC! this is not an R course and we won’t evaluate your R skills, you should be able to follow by using this html document.

Basics of network analysis

Network analysis is a set of techniques that can be used to investigate the causes and consequences of complex social and ecological interactions among the species within a community (Farine & Whitehead, 2015). Networks enable the visualization of complex, multidimensional data as well as provide diverse statistical indices for interpreting the resultant graphs (Jones et al., 2018). This set of techniques were first developed in the 1930s by social scientists (Wasserman & Faust, 1994), to investigate the link between local patterns of human relationships and social processes, such as the impact of social groups on the likelihood of being obese (Christakis & Fowler, 2007).

Network analysis is based on graph theory and provides a flexible framework for analysing association or interaction data to address a broad set of biological questions, such as:

  • Describe a social structure.
  • Detect affiliation or avoidance between conspecifics.
  • Interspecific interactions.
  • Mating behaviour.
  • Genetic networks.
  • Community-level linkages.
  • Animal movement.

What is a network?

The term ‘network’ can mean different things to different people. Here, we will define a network as a description for the observed patterns of associations or interactions between species or conspecifics (Farine & Whitehead, 2015).

Network structure

Networks consist of nodes connected by edges:

  • Nodes or vertices can represent individuals, species, groups, classes of individuals or other entities. Each node can possess attributes, such as the identity and phenotypic traits of the individual it represents.
  • Edges or links generally represent how two nodes relate to one another, and can be used to describe how frequently they associate or interact, or to describe other relationships (e.g. genetic relatedness). They often have numeric values (weighted edges), describing the strength of the relationship (e.g. rates or numbers of interactions), although it can sometimes be useful to conceive of edges as binary (either 0 or 1) indicating the presence or absence of a relationship (e.g. whether a male–female pair have copulated or not). It can also be undirected (lack of direction) or directed (we know the direction of the interaction).

Figure 1. Network representation and its components.

Representation of networks

We can represent networks in two ways.

  • Adjacency matrix: an N x N matrix, where N is the number of individuals in the study. Typically, the matrix is read as the ‘actors’ along the rows associating or interacting with the ‘receivers’ along the columns.
    • Rows and columns represent nodes (n x n).

    • Values of the matrix represent the weight of each edge.

    • Diagonals represent self-edges (usually equal to 0 in ecological communities).

    • Symmetrical (undirected networks) or asymmetrical (directed networks).

  • Network diagrams: to visualise social connections and the overall structure of the network.
    • 2D representations of the network.
    • Nodes arranged depending on their connections within the graph.
    • The edges connect the different nodes of the network and determine whether it is directed (arrows) or undirected (single lines).
    • Node and edge attributes or network metrics can be represented.

Figure 2. Different methods for network representation.


Question

Are the network and the adjacency matrix from Figure 2 directed or undirected?


Importing network data into R

The first step of any data analysis is to load the data. Here, we will be using a small network of interactions from the movie Star Wars Episode IV. Each node is a character and each edge indicates whether they appeared together in a scene of the movie or not. Edges here are thus undirected and they also have weights attached, since they can appear in multiple scenes together.

The first step of any r script is to install and/or load the libraries that we are going to use. To install packages we will use the function install.packages which allows us to install any r packages that we ask for. To load these packages we are going to use the function library which allows us to call the r packages that we have installed in our computer.

# install.packages("igraph")
# install.packages("tidyverse")
# install.packages("DT")
library(igraph)
library(tidyverse)
library(DT) # visualize tables in Rmarkdown (this package doesn't need to be loaded by you)

Then we load the data into r. We will do this using the function read.csv. Notice that we load the list of edges and nodes separately.

edges <- read.csv("data/star-wars-network-edges.csv")
nodes <- read.csv("data/star-wars-network-nodes.csv")

Now we can have a look at the data

edges %>% 
  DT::datatable()
nodes %>% 
  DT::datatable()

Questions

How many characters appear in the movie?

How many unique interactions do we have?

How many times C-3PO and R2-D2 appeared in scenes together?


Create a network from data

So, how do we convert these two datasets into a network object in R? There are multiple packages to work with networks, but the most popular is igraph because it’s very flexible and easy to use. Other packages that you may want to explore are sna and networks.

To create a network we need to create an igraph object. To do that, we will use the function graph_from_data_frame, which has two arguments: d, a data frame with the edge list; and vertices, a data frame with node data with the node label in the first column.

g <- graph_from_data_frame(d=edges, vertices=nodes, directed=FALSE)
g
## IGRAPH a55215c UNW- 22 60 -- 
## + attr: name (v/c), id (v/n), weight (e/n)
## + edges from a55215c (vertex names):
##  [1] R2-D2      --C-3PO       R2-D2      --LUKE        R2-D2      --OBI-WAN    
##  [4] R2-D2      --LEIA        R2-D2      --HAN         R2-D2      --CHEWBACCA  
##  [7] R2-D2      --DODONNA     CHEWBACCA  --OBI-WAN     CHEWBACCA  --C-3PO      
## [10] CHEWBACCA  --LUKE        CHEWBACCA  --HAN         CHEWBACCA  --LEIA       
## [13] CHEWBACCA  --DARTH VADER CHEWBACCA  --DODONNA     LUKE       --CAMIE      
## [16] CAMIE      --BIGGS       LUKE       --BIGGS       DARTH VADER--LEIA       
## [19] LUKE       --BERU        BERU       --OWEN        C-3PO      --BERU       
## [22] LUKE       --OWEN        C-3PO      --LUKE        C-3PO      --OWEN       
## + ... omitted several edges

Let’s interpret the output together

  • U means that the network is undirected.
  • N means named graph.
  • W means weighted graph.
  • 22 is the number of nodes.
  • 60 is the number of edges.
  • name (v/c) means name is a node attribute and it’s a character.
  • weight (e/n) means weight is an edge attribute and it’s numeric.

We also have a bunch of functions to access to specific elements within the igraph object:

V(g) # nodes
## + 22/22 vertices, named, from a55215c:
##  [1] R2-D2       CHEWBACCA   C-3PO       LUKE        DARTH VADER CAMIE      
##  [7] BIGGS       LEIA        BERU        OWEN        OBI-WAN     MOTTI      
## [13] TARKIN      HAN         GREEDO      JABBA       DODONNA     GOLD LEADER
## [19] WEDGE       RED LEADER  RED TEN     GOLD FIVE
V(g)$name # names of each node
##  [1] "R2-D2"       "CHEWBACCA"   "C-3PO"       "LUKE"        "DARTH VADER"
##  [6] "CAMIE"       "BIGGS"       "LEIA"        "BERU"        "OWEN"       
## [11] "OBI-WAN"     "MOTTI"       "TARKIN"      "HAN"         "GREEDO"     
## [16] "JABBA"       "DODONNA"     "GOLD LEADER" "WEDGE"       "RED LEADER" 
## [21] "RED TEN"     "GOLD FIVE"
vertex_attr(g) # all attributes of the nodes
## $name
##  [1] "R2-D2"       "CHEWBACCA"   "C-3PO"       "LUKE"        "DARTH VADER"
##  [6] "CAMIE"       "BIGGS"       "LEIA"        "BERU"        "OWEN"       
## [11] "OBI-WAN"     "MOTTI"       "TARKIN"      "HAN"         "GREEDO"     
## [16] "JABBA"       "DODONNA"     "GOLD LEADER" "WEDGE"       "RED LEADER" 
## [21] "RED TEN"     "GOLD FIVE"  
## 
## $id
##  [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21
E(g) # edges
## + 60/60 edges from a55215c (vertex names):
##  [1] R2-D2      --C-3PO       R2-D2      --LUKE        R2-D2      --OBI-WAN    
##  [4] R2-D2      --LEIA        R2-D2      --HAN         R2-D2      --CHEWBACCA  
##  [7] R2-D2      --DODONNA     CHEWBACCA  --OBI-WAN     CHEWBACCA  --C-3PO      
## [10] CHEWBACCA  --LUKE        CHEWBACCA  --HAN         CHEWBACCA  --LEIA       
## [13] CHEWBACCA  --DARTH VADER CHEWBACCA  --DODONNA     LUKE       --CAMIE      
## [16] CAMIE      --BIGGS       LUKE       --BIGGS       DARTH VADER--LEIA       
## [19] LUKE       --BERU        BERU       --OWEN        C-3PO      --BERU       
## [22] LUKE       --OWEN        C-3PO      --LUKE        C-3PO      --OWEN       
## [25] C-3PO      --LEIA        LUKE       --LEIA        LEIA       --BERU       
## [28] LUKE       --OBI-WAN     C-3PO      --OBI-WAN     LEIA       --OBI-WAN    
## + ... omitted several edges
E(g)$weight # weights for each edge
##  [1] 17 13  6  5  5  3  1  7  5 16 19 11  1  1  2  2  4  1  3  3  2  3 18  2  6
## [26] 17  1 19  6  1  2  1  7  9 26  1  1  6  1  1 13  1  1  1  1  1  1  2  1  1
## [51]  3  3  1  1  3  1  2  1  1  1
edge_attr(g) # all attributes of the edges
## $weight
##  [1] 17 13  6  5  5  3  1  7  5 16 19 11  1  1  2  2  4  1  3  3  2  3 18  2  6
## [26] 17  1 19  6  1  2  1  7  9 26  1  1  6  1  1 13  1  1  1  1  1  1  2  1  1
## [51]  3  3  1  1  3  1  2  1  1  1
g[] # adjacency matrix
## 22 x 22 sparse Matrix of class "dgCMatrix"
##                                                               
## R2-D2        .  3 17 13 . . .  5 . .  6 . .  5 . . 1 . . . . .
## CHEWBACCA    3  .  5 16 1 . . 11 . .  7 . . 19 . . 1 . . . . .
## C-3PO       17  5  . 18 . . 1  6 2 2  6 . .  6 . . . . . 1 . .
## LUKE        13 16 18  . . 2 4 17 3 3 19 . . 26 . . 1 1 2 3 1 .
## DARTH VADER  .  1  .  . . . .  1 . .  1 1 7  . . . . . . . . .
## CAMIE        .  .  .  2 . . 2  . . .  . . .  . . . . . . . . .
## BIGGS        .  .  1  4 . 2 .  1 . .  . . .  . . . . 1 2 3 . .
## LEIA         5 11  6 17 1 . 1  . 1 .  1 1 1 13 . . . . . 1 . .
## BERU         .  .  2  3 . . .  1 . 3  . . .  . . . . . . . . .
## OWEN         .  .  2  3 . . .  . 3 .  . . .  . . . . . . . . .
## OBI-WAN      6  7  6 19 1 . .  1 . .  . . .  9 . . . . . . . .
## MOTTI        .  .  .  . 1 . .  1 . .  . . 2  . . . . . . . . .
## TARKIN       .  .  .  . 7 . .  1 . .  . 2 .  . . . . . . . . .
## HAN          5 19  6 26 . . . 13 . .  9 . .  . 1 1 . . . . . .
## GREEDO       .  .  .  . . . .  . . .  . . .  1 . . . . . . . .
## JABBA        .  .  .  . . . .  . . .  . . .  1 . . . . . . . .
## DODONNA      1  1  .  1 . . .  . . .  . . .  . . . . 1 1 . . .
## GOLD LEADER  .  .  .  1 . . 1  . . .  . . .  . . . 1 . 1 1 . .
## WEDGE        .  .  .  2 . . 2  . . .  . . .  . . . 1 1 . 3 . .
## RED LEADER   .  .  1  3 . . 3  1 . .  . . .  . . . . 1 3 . 1 .
## RED TEN      .  .  .  1 . . .  . . .  . . .  . . . . . . 1 . .
## GOLD FIVE    .  .  .  . . . .  . . .  . . .  . . . . . . . . .
g[1,] # first row of adjacency matrix
##       R2-D2   CHEWBACCA       C-3PO        LUKE DARTH VADER       CAMIE 
##           0           3          17          13           0           0 
##       BIGGS        LEIA        BERU        OWEN     OBI-WAN       MOTTI 
##           0           5           0           0           6           0 
##      TARKIN         HAN      GREEDO       JABBA     DODONNA GOLD LEADER 
##           0           5           0           0           1           0 
##       WEDGE  RED LEADER     RED TEN   GOLD FIVE 
##           0           0           0           0

Network visualization

How can we visualize this network? We can use the function plot() from base r, which can produce a decent output:

par(mar=c(0,0,0,0))
plot(g)

As you can see, there is a lot of room for improvement in this figure. Similarly, to any r function, we can check all the available plotting options, using this code ?igraph.plotting. Let’s improve the graph a bit.

par(mar=c(0,0,0,0))
plot(g,
     vertex.color = "grey", # change color of nodes
     vertex.label.color = "black", # change color of labels
     vertex.label.cex = .75, # change size of labels to 75% of original size
     edge.curved=.25, # add a 25% curve to the edges
     edge.color="grey20") # change edge color to grey

In some cases we might want to modify some of the plotting attributes according to some properties of the network. For example, we could change the size of the nodes and node labels so that they match their importance. In our data set, strength corresponds to the number of scenes the characters appear in.

V(g)$size <- strength(g)
par(mar=c(0,0,0,0)) 
plot(g)

The plot is a bit messy, and we can’t actually see all the characters. One thing we can do is to show only those characters that appear in 10 or more scenes.

# taking the log to improve the visualisation
V(g)$size <- log(strength(g)) * 4 + 3
V(g)$label <- ifelse( strength(g)>=10, V(g)$name, NA )
par(mar=c(0,0,0,0)) 
plot(g)

# what does `ifelse` do?
nodes$name=="R2-D2"
##  [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
ifelse(nodes$name=="R2-D2", "yes", "no")
##  [1] "yes" "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no" 
## [13] "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no"
ifelse(grepl("R", nodes$name), "yes", "no")
##  [1] "yes" "no"  "no"  "no"  "yes" "no"  "no"  "no"  "yes" "no"  "no"  "no" 
## [13] "yes" "no"  "yes" "no"  "no"  "yes" "no"  "yes" "yes" "no"

We can also change the colors of each node based on what side they’re in (dark side or light side).

# create vectors with characters in each side
dark_side <- c("DARTH VADER", "MOTTI", "TARKIN")
light_side <- c("R2-D2", "CHEWBACCA", "C-3PO", "LUKE", "CAMIE", "BIGGS", "LEIA", "BERU", "OWEN", "OBI-WAN", "HAN", "DODONNA", "GOLD LEADER", "WEDGE", "RED LEADER", "RED TEN", "GOLD FIVE")
other <- c("GREEDO", "JABBA")
# node we'll create a new color variable as a node property
V(g)$color <- NA
V(g)$color[V(g)$name %in% dark_side] <- "red"
V(g)$color[V(g)$name %in% light_side] <- "gold"
V(g)$color[V(g)$name %in% other] <- "grey20"
vertex_attr(g)
## $name
##  [1] "R2-D2"       "CHEWBACCA"   "C-3PO"       "LUKE"        "DARTH VADER"
##  [6] "CAMIE"       "BIGGS"       "LEIA"        "BERU"        "OWEN"       
## [11] "OBI-WAN"     "MOTTI"       "TARKIN"      "HAN"         "GREEDO"     
## [16] "JABBA"       "DODONNA"     "GOLD LEADER" "WEDGE"       "RED LEADER" 
## [21] "RED TEN"     "GOLD FIVE"  
## 
## $id
##  [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21
## 
## $size
##  [1] 18.648092 19.572539 19.635532 22.439250 12.591581  8.545177 13.556229
##  [8] 19.310150 11.788898 11.317766 18.567281  8.545177 12.210340 20.528107
## [15]  3.000000  3.000000  9.437752  9.437752 11.788898 13.259797  5.772589
## [22]      -Inf
## 
## $label
##  [1] "R2-D2"       "CHEWBACCA"   "C-3PO"       "LUKE"        "DARTH VADER"
##  [6] NA            "BIGGS"       "LEIA"        NA            NA           
## [11] "OBI-WAN"     NA            "TARKIN"      "HAN"         NA           
## [16] NA            NA            NA            NA            "RED LEADER" 
## [21] NA            NA           
## 
## $color
##  [1] "gold"   "gold"   "gold"   "gold"   "red"    "gold"   "gold"   "gold"  
##  [9] "gold"   "gold"   "gold"   "red"    "red"    "gold"   "grey20" "grey20"
## [17] "gold"   "gold"   "gold"   "gold"   "gold"   "gold"
par(mar=c(0,0,0,0)) 
plot(g)

# what does %in% do?
1 %in% c(1,2,3,4)
## [1] TRUE
1 %in% c(2,3,4)
## [1] FALSE

If we want to indicate what the colors correspond to, we can add a legend.

par(mar=c(0,0,0,0)) 
plot(g)
legend(x=.75, y=.75, 
       legend=c("Dark side", "Light side", "Other"), 
       pch=21, pt.bg=c("red", "gold", "grey20"),
       pt.cex=2, bty="n")

Edge properties can also be modified. For example, we can change the width of each edge, to make it a function of the log number of scenes those two characters appear together.

E(g)$width <- log(E(g)$weight) + 1
edge_attr(g)
## $weight
##  [1] 17 13  6  5  5  3  1  7  5 16 19 11  1  1  2  2  4  1  3  3  2  3 18  2  6
## [26] 17  1 19  6  1  2  1  7  9 26  1  1  6  1  1 13  1  1  1  1  1  1  2  1  1
## [51]  3  3  1  1  3  1  2  1  1  1
## 
## $width
##  [1] 3.833213 3.564949 2.791759 2.609438 2.609438 2.098612 1.000000 2.945910
##  [9] 2.609438 3.772589 3.944439 3.397895 1.000000 1.000000 1.693147 1.693147
## [17] 2.386294 1.000000 2.098612 2.098612 1.693147 2.098612 3.890372 1.693147
## [25] 2.791759 3.833213 1.000000 3.944439 2.791759 1.000000 1.693147 1.000000
## [33] 2.945910 3.197225 4.258097 1.000000 1.000000 2.791759 1.000000 1.000000
## [41] 3.564949 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.693147
## [49] 1.000000 1.000000 2.098612 2.098612 1.000000 1.000000 2.098612 1.000000
## [57] 1.693147 1.000000 1.000000 1.000000
par(mar=c(0,0,0,0))
plot(g)

If you noticed, every time we visualise the network with the function plot, the nodes appear in a different location. This is because it’s running a probabilistic function trying to locate them in the optimal way possible.

However, we can also specify the layout for the plot; that is, the (x,y) coordinates where each node will be placed. igraph has a few different layouts built-in, that will use different algorithms to find an optimal distribution of nodes. The following code illustrates some of these:

par(mfrow=c(2, 3), mar=c(0,0,1,0))
plot(g, layout=layout_randomly, main="Random")
plot(g, layout=layout_in_circle, main="Circle")
plot(g, layout=layout_as_star, main="Star")
plot(g, layout=layout_as_tree, main="Tree")
plot(g, layout=layout_on_grid, main="Grid")
plot(g, layout=layout_with_fr, main="Force-directed")

Note that each of these is actually just a matrix of (x,y) locations for each node.

l <- layout_randomly(g)
str(l)
##  num [1:22, 1:2] -0.9 -0.309 -0.967 0.744 -0.164 ...

The most popular layouts are force-directed. These algorithms, such as Fruchterman-Reingold, try to position the nodes so that the edges have similar length and there are as few crossing edges as possible. The idea is to generate “clean” layouts, where nodes that are closer to each other share more connections in common that those that are located further apart. Note that this is a random algorithm, it might change every time we run it or among different computers. If we want to make it consistent, we will use the function set.seed, which allows us to control the random process and make it consistent through time. That is, if we set.seed and we add a number to it set.seed(777) if we run the plot in different computers and at different times the plot will be the same. If we change the number, the plot will change. Let’s see an example.

par(mfrow=c(1,2))
set.seed(777)
fr <- layout_with_fr(g, niter=1000)
par(mar=c(0,0,0,0))
plot(g, layout=fr)
set.seed(666)
fr <- layout_with_fr(g, niter=1000)
par(mar=c(0,0,0,0)) 
plot(g, layout=fr)


Challenge

Using the Game of Thrones data set that you will be given plot the network of the characters from Game of Thrones using the most optimal display.


Descriptive analyses

Ok, so now we have seen the very basics of how to visualise a network. While visualising the graph already gives us some information, we can use the network to do much more. For example, what are the most important species (nodes) in a network? What is the propensity of two species (nodes) that are connected to both be related to a third one? What are the different hidden communities in a network? So now we are going to have a look at some of these properties using the Star Wars network.

Node properties

Studies use network metrics to estimate differences between individuals in their placement within a social network (at the node level). In animal networks, the most common are based on measures of importance or centrality. This means that with the code below we can measure, in different ways, the importance of a given node (species), or, in our case, the characters of the movie.

The most basic measure is degree, that can be calculated with the function degree. The binary degree is the count of the number of edges connected to the node. If a network is directed, degree can be the partitioned into in-degree and out-degree, representing the number of incoming and outgoing edges, respectively. You can find these using mode="in" or mode="out" or mode="total". This measure captures the gregariousness of individuals, in terms of the number of associates or interaction partners. In the Star Wars network, it will be the unique number of characters that each character is interacting with. We use the function sort to order the values, from low to high.

sort(degree(g))
##   GOLD FIVE      GREEDO       JABBA       CAMIE     RED TEN        OWEN 
##           0           1           1           2           2           3 
##       MOTTI      TARKIN        BERU DARTH VADER     DODONNA GOLD LEADER 
##           3           3           4           5           5           5 
##       WEDGE       R2-D2       BIGGS     OBI-WAN  RED LEADER   CHEWBACCA 
##           5           7           7           7           7           8 
##         HAN       C-3PO        LEIA        LUKE 
##           8          10          12          15

Strength is the sum of all edge weights connected to the node (if all edges have a weight of 1, then the strength will equal the binary degree). Strength can also be partitioned into in-strength and out-strength for directed networks. This measure typically represents the expected total interaction or association rate per sample. For example, a node with a strength of 2 would be expected to be observed with approximately two other individuals on average (if using most association indices).

sort(strength(g))
##   GOLD FIVE      GREEDO       JABBA     RED TEN       CAMIE       MOTTI 
##           0           1           1           2           4           4 
##     DODONNA GOLD LEADER        OWEN        BERU       WEDGE      TARKIN 
##           5           5           8           9           9          10 
## DARTH VADER  RED LEADER       BIGGS     OBI-WAN       R2-D2        LEIA 
##          11          13          14          49          50          59 
##   CHEWBACCA       C-3PO         HAN        LUKE 
##          63          64          80         129

Question

Why is the order of the degree and the strength so different?


Closeness measures how many steps are required to access every other node from a given node. It’s a measure of how long information takes to arrive (who hears news first?). So it attempts to capture the notion that a vertex is ‘central’ if it is ‘close’ to many other vertices. Higher values mean less centrality.

sort(closeness(g, normalized=TRUE))
##      GREEDO       JABBA         HAN        OWEN       CAMIE      TARKIN 
##   0.1282051   0.1282051   0.1459854   0.2083333   0.2247191   0.2500000 
##     OBI-WAN       MOTTI       R2-D2        BERU       WEDGE     RED TEN 
##   0.2631579   0.2631579   0.2739726   0.2739726   0.2777778   0.2857143 
##   CHEWBACCA DARTH VADER        LUKE       C-3PO        LEIA     DODONNA 
##   0.2985075   0.2985075   0.3076923   0.3125000   0.3278689   0.3333333 
## GOLD LEADER       BIGGS  RED LEADER 
##   0.3333333   0.3448276   0.3448276

Betweenness a count of the number of shortest paths that flow through the node. This measures how important a node is for connecting disparate parts of a social network. Individuals with a high betweenness centrality are likely to connect largely independent communities. This often highlights individuals that have a greater tendency to change groups than others.

sort(betweenness(g))
##       CAMIE        OWEN     OBI-WAN       MOTTI      TARKIN      GREEDO 
##    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000 
##       JABBA       WEDGE   GOLD FIVE        BERU     RED TEN DARTH VADER 
##    0.000000    0.000000    0.000000    1.666667    2.200000   15.583333 
##   CHEWBACCA        LUKE       R2-D2 GOLD LEADER  RED LEADER       BIGGS 
##   15.916667   18.333333   22.750000   23.800000   31.416667   31.916667 
##       C-3PO         HAN     DODONNA        LEIA 
##   32.783333   37.000000   47.533333   59.950000

Eigenvector centrality the sum of the centralities of an individual’s neighbours. High centrality can be achieved by having either a large degree or being connected to associates with a high degree (or both). This is a useful property as it captures the potential ‘importance’ of individuals in the network, as social hubs, or for the propagation of information or diseases through animal populations.

sort(eigen_centrality(g)$vector)
##    GOLD FIVE        MOTTI       TARKIN       GREEDO        JABBA      RED TEN 
## 1.839045e-20 9.118469e-03 1.133319e-02 1.154515e-02 1.154515e-02 1.512880e-02 
##  GOLD LEADER  DARTH VADER        CAMIE      DODONNA        WEDGE         OWEN 
## 1.717615e-02 2.671707e-02 3.064953e-02 3.143121e-02 3.410098e-02 6.256707e-02 
##   RED LEADER         BERU        BIGGS        R2-D2      OBI-WAN         LEIA 
## 6.476889e-02 7.063977e-02 7.856101e-02 5.030995e-01 5.417666e-01 5.923767e-01 
##        C-3PO    CHEWBACCA          HAN         LUKE 
## 5.957835e-01 6.577603e-01 8.125507e-01 1.000000e+00

Authority score is another measure of centrality initially applied to the Web. A node has high authority when it is linked by many other nodes that are linking many other nodes.

sort(authority_score(g)$vector)
##    GOLD FIVE        MOTTI       TARKIN       GREEDO        JABBA      RED TEN 
## 3.874369e-17 9.118469e-03 1.133319e-02 1.154515e-02 1.154515e-02 1.512880e-02 
##  GOLD LEADER  DARTH VADER        CAMIE      DODONNA        WEDGE         OWEN 
## 1.717615e-02 2.671707e-02 3.064953e-02 3.143121e-02 3.410098e-02 6.256707e-02 
##   RED LEADER         BERU        BIGGS        R2-D2      OBI-WAN         LEIA 
## 6.476889e-02 7.063977e-02 7.856101e-02 5.030995e-01 5.417666e-01 5.923767e-01 
##        C-3PO    CHEWBACCA          HAN         LUKE 
## 5.957835e-01 6.577603e-01 8.125507e-01 1.000000e+00

Page rank approximates probability that any message will arrive to a particular node. This algorithm was developed by Google founders, and originally applied to website links. It returns the importance of each node as a proportion. The value for each node is proportional to the importance of all the nodes connected to it. Individuals with a large page rank are disproportionately important for connecting different components of the network, and this measure is likely to be important when investigating flows through networks.

sort(page_rank(g)$vector)
##   GOLD FIVE       JABBA      GREEDO     RED TEN       CAMIE     DODONNA 
## 0.007092199 0.008310156 0.008310156 0.010573836 0.013792262 0.016185680 
##       MOTTI GOLD LEADER        OWEN        BERU       WEDGE      TARKIN 
## 0.016813964 0.017945853 0.018881975 0.020368818 0.026377242 0.034180007 
## DARTH VADER  RED LEADER       BIGGS     OBI-WAN       R2-D2        LEIA 
## 0.034576040 0.034578060 0.035070288 0.067378471 0.068538690 0.086027500 
##   CHEWBACCA       C-3PO         HAN        LUKE 
## 0.086390090 0.088708430 0.114631333 0.185268949

We can visualise all these different measures of centrality.

# Calculate them
deg <- degree(g)
str <- strength(g)
clo <- closeness(g) 
bet <- betweenness(g)
eig <- eigen_centrality(g)
aut <- authority_score(g)
ran <- page_rank(g)

# Plot the characters with the highest values of each parameter
par(mar = c(3.1, 8.1, 3.1, 1), mfrow = c(3, 3))
barplot(deg[order(deg, decreasing = TRUE)][1:5], las = 2, horiz = TRUE, 
        main = "Degree")
barplot(str[order(str, decreasing = TRUE)][1:5], las = 2, horiz = TRUE,
        main = "Weighted degree")
barplot(clo[order(clo, decreasing = TRUE)][1:5], las = 2, horiz = TRUE,
        main = "Closeness")
barplot(bet[order(bet, decreasing = TRUE)][1:5], las = 2, horiz = TRUE,
        main = "Betweenness")
barplot(eig$vector[order(eig$vector, decreasing = TRUE)][1:5], las = 2, 
        horiz = TRUE, main = "Eigenvector centrality")
barplot(aut$vector[order(aut$vector, decreasing = TRUE)][1:5], las = 2, 
        horiz = TRUE, main = "Authority score")
barplot(ran$vector[order(ran$vector, decreasing = TRUE)][1:5], las = 2, 
        horiz = TRUE, main = "Page Rank")


Question

Do you spot any difference between the eigenvector centrality, authority score and the page rank?

Which is the most central character in the movie?


Finally, not exactly a measure of centrality, but we can learn more about who each node is connected to by using the following functions: neighbors (for direct neighbors) and ego (for neighbors up to n neighbors away)

neighbors(g, v=which(V(g)$name=="DARTH VADER"))
## + 5/22 vertices, named, from e83521e:
## [1] CHEWBACCA LEIA      OBI-WAN   MOTTI     TARKIN
ego(g, order=2, nodes=which(V(g)$name=="DARTH VADER"))
## [[1]]
## + 14/22 vertices, named, from e83521e:
##  [1] DARTH VADER CHEWBACCA   LEIA        OBI-WAN     MOTTI       TARKIN     
##  [7] R2-D2       C-3PO       LUKE        HAN         DODONNA     BIGGS      
## [13] BERU        RED LEADER

Challenge

Who are the most relevant characters in the the tv show Game of Thrones, according to different measures of centrality?


Comparing Centrality Scores

All the above mentioned measures of centrality are useful on their own, but the correlation between them can also give us information about the structure of the network. It is important to be careful when interpreting network metrics, as these depend on both the measure used (e.g. degree vs. betweenness) and the edge definition (e.g. rates vs. number of interactions), and on how the network is structured (e.g. if a network is structured into communities, metrics calculated within a single community may be very different from those calculated for the entire network). Because the metrics can have different interpretations on different social networks, it is important to visualise the structure of the network and the correlation structures between different metrics.

In Figure 3, Farine & Whitehead (2015) present two toy networks. The first (a) contains two clusters joined by a single individual. In this network, node 1 provides a bridge between the two clusters, and it has a high betweenness (b). However, node 1 has the lowest degree. Thus, if we were to investigate dynamics of spread, betweenness might provide a better estimate of relative node importance. The second network (c) contains individuals that are more uniformly connected. Individual 1 in this network has both the highest degree and, by far, the highest betweenness. Thus, when interpreting the relative importance of nodes in a network, the relationships between different network metrics may be informative.

Figure 3. The effect of network structure on network metric correlations (Farine & Whitehead, 2015).

To understand the relationship of these different metrics in our data we can correlate them. To do that, we need to first put all the centrality metrics together in a single data frame using the function cbind() and correlate them using the function cor(). We are going to round the correlation to two decimals using the function round(,2).

centralities <- cbind(deg, str, bet, eig$vector, aut$vector, ran$vector)
colnames(centralities) <- c("degree", "structure", "betweeness", "eigenvector", "authority", "rank")

round(cor(centralities), 2)
##             degree structure betweeness eigenvector authority rank
## degree        1.00      0.88       0.66        0.84      0.84 0.90
## structure     0.88      1.00       0.41        0.98      0.98 0.99
## betweeness    0.66      0.41       1.00        0.43      0.43 0.42
## eigenvector   0.84      0.98       0.43        1.00      1.00 0.96
## authority     0.84      0.98       0.43        1.00      1.00 0.96
## rank          0.90      0.99       0.42        0.96      0.96 1.00

Challenge

Compare the centrality of the Star Wars network with the one from Game of Thrones.


Network properties

Another important thing we can do with a network is to describe its properties as a whole.

We can start with measures of the size of a network. diameter is the length of the longest path (in number of edges) between two nodes. We can use get_diameter to identify this longest path. mean_distance is the average number of edges between any two nodes in the network. We can find each of these paths between pairs of edges with distances.

diameter(g, directed=FALSE, weights=NA)
## [1] 3
get_diameter(g, directed=FALSE, weights=NA)
## + 4/22 vertices, named, from e83521e:
## [1] DARTH VADER CHEWBACCA   C-3PO       OWEN
mean_distance(g, directed=FALSE, weights=NA)
## [1] 1.909524
dist <- distances(g, weights=NA)
dist[1:5, 1:5]
##             R2-D2 CHEWBACCA C-3PO LUKE DARTH VADER
## R2-D2           0         1     1    1           2
## CHEWBACCA       1         0     1    1           1
## C-3PO           1         1     0    1           2
## LUKE            1         1     1    0           2
## DARTH VADER     2         1     2    2           0

edge_density is the number of edges in a network divided by the total possible edges, or the sum of edge weights divided by the number of possible edges. A potentially important measure for normalizing observed degree distributions as larger networks tend towards very low densities.

edge_density(g)
## [1] 0.2597403
# 22*21 possible edges / 2 because it's undirected = 231 possible edges
# but only 60 exist
60/((22*21)/2)
## [1] 0.2597403

reciprocity measures the propensity of each edge to be a mutual edge; that is, the probability that if i is connected to j, j is also connected to i.

reciprocity(g)
## [1] 1
# Why is it 1?

transitivity, also known as clustering coefficient, measures the probability that adjacent nodes of a network are connected. In other words, if i is connected to j, and j is connected to k, what is the probability that i is also connected to k? This is potentially an important measure, particularly when measuring interactions, as it captures the level of clustering in the network.

transitivity(g)
## [1] 0.5375303

Challenge

Where do the characters show a larger proportion of connections, in Star Wars or Game of Thrones?


Network communities

Networks often have different clusters or communities of nodes that are more densely connected to each other than to the rest of the network. Let’s cover some of the different existing methods to identify these communities.

The most straightforward way to partition a network is into connected components. Each component is a group of nodes that are connected to each other, but not to the rest of the nodes. For example, this network has two components.

components(g)
## $membership
##       R2-D2   CHEWBACCA       C-3PO        LUKE DARTH VADER       CAMIE 
##           1           1           1           1           1           1 
##       BIGGS        LEIA        BERU        OWEN     OBI-WAN       MOTTI 
##           1           1           1           1           1           1 
##      TARKIN         HAN      GREEDO       JABBA     DODONNA GOLD LEADER 
##           1           1           1           1           1           1 
##       WEDGE  RED LEADER     RED TEN   GOLD FIVE 
##           1           1           1           2 
## 
## $csize
## [1] 21  1
## 
## $no
## [1] 2
par(mar=c(0,0,0,0)) 
plot(g)

Most networks have a single giant connected component that includes most nodes. Most studies of networks actually focus on the giant component (e.g. the shortest path between nodes in a network with two or more component is Inf!).

giant <- decompose(g)[[1]]
plot(giant)

Even within a giant component, there can be different subsets of the network that are more connected to each other than to the rest of the network. The goal of community detection algorithms is to identify these subsets.

There are a few different algorithms, each following a different logic.

The walktrap algorithm finds communities through a series of short random walks. The idea is that these random walks tend to stay within the same community. The length of these random walks is 4 edges by default, but you may want to experiment with different values. The goal of this algorithm is to identify the partition that maximizes a modularity score. The modularity of a graph with respect to some division (or vertex types) measures how good the division is, or how separated are the different vertex types from each other. Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules.

cluster_walktrap(giant)
## IGRAPH clustering walktrap, groups: 6, mod: 0.16
## + groups:
##   $`1`
##   [1] "CAMIE"       "BIGGS"       "DODONNA"     "GOLD LEADER" "WEDGE"      
##   [6] "RED LEADER"  "RED TEN"    
##   
##   $`2`
##   [1] "DARTH VADER" "MOTTI"       "TARKIN"     
##   
##   $`3`
##   [1] "R2-D2"     "CHEWBACCA" "C-3PO"     "LUKE"      "LEIA"      "OBI-WAN"  
##   [7] "HAN"      
##   + ... omitted several groups/vertices
cluster_walktrap(giant, steps=10)
## IGRAPH clustering walktrap, groups: 3, mod: 0.15
## + groups:
##   $`1`
##   [1] "DARTH VADER" "MOTTI"       "TARKIN"     
##   
##   $`2`
##    [1] "R2-D2"     "CHEWBACCA" "C-3PO"     "LUKE"      "LEIA"      "BERU"     
##    [7] "OWEN"      "OBI-WAN"   "HAN"       "GREEDO"    "JABBA"    
##   
##   $`3`
##   [1] "CAMIE"       "BIGGS"       "DODONNA"     "GOLD LEADER" "WEDGE"      
##   [6] "RED LEADER"  "RED TEN"    
##   + ... omitted several groups/vertices

Other methods are:

  • The fast and greedy method tries to directly optimize this modularity score.
  • The infomap method attempts to map the flow of information in a network, and the different clusters in which information may remain for longer periods. Similar to walktrap, but not necessarily maximizing modularity, but rather the so-called “map equation”.
  • The edge-betweenness method iteratively removes edges with high betweenness, with the idea that they are likely to connect different parts of the network. Here betweenness (gatekeeping potential) applies to edges, but the intuition is the same.
  • The label propagation method labels each node with unique labels, and then updates these labels by choosing the label assigned to the majority of their neighbors, and repeat this iteratively until each node has the most common labels among its neighbors.
cluster_fast_greedy(giant)
## IGRAPH clustering fast greedy, groups: 4, mod: 0.17
## + groups:
##   $`1`
##   [1] "CHEWBACCA" "LUKE"      "LEIA"      "OBI-WAN"   "HAN"       "GREEDO"   
##   [7] "JABBA"    
##   
##   $`2`
##   [1] "R2-D2" "C-3PO" "BERU"  "OWEN" 
##   
##   $`3`
##   [1] "CAMIE"       "BIGGS"       "DODONNA"     "GOLD LEADER" "WEDGE"      
##   [6] "RED LEADER"  "RED TEN"    
##   + ... omitted several groups/vertices
cluster_edge_betweenness(giant)
## IGRAPH clustering edge betweenness, groups: 2, mod: 0.033
## + groups:
##   $`1`
##    [1] "R2-D2"       "CHEWBACCA"   "DARTH VADER" "LEIA"        "OBI-WAN"    
##    [6] "MOTTI"       "TARKIN"      "HAN"         "GREEDO"      "JABBA"      
##   
##   $`2`
##    [1] "C-3PO"       "LUKE"        "CAMIE"       "BIGGS"       "BERU"       
##    [6] "OWEN"        "DODONNA"     "GOLD LEADER" "WEDGE"       "RED LEADER" 
##   [11] "RED TEN"    
## 
cluster_infomap(giant)
## IGRAPH clustering infomap, groups: 2, mod: 0.064
## + groups:
##   $`1`
##    [1] "R2-D2"       "CHEWBACCA"   "C-3PO"       "LUKE"        "CAMIE"      
##    [6] "BIGGS"       "LEIA"        "BERU"        "OWEN"        "OBI-WAN"    
##   [11] "HAN"         "GREEDO"      "JABBA"       "DODONNA"     "GOLD LEADER"
##   [16] "WEDGE"       "RED LEADER"  "RED TEN"    
##   
##   $`2`
##   [1] "DARTH VADER" "MOTTI"       "TARKIN"     
## 
cluster_label_prop(giant)
## IGRAPH clustering label propagation, groups: 2, mod: 0.064
## + groups:
##   $`1`
##    [1] "R2-D2"       "CHEWBACCA"   "C-3PO"       "LUKE"        "CAMIE"      
##    [6] "BIGGS"       "LEIA"        "BERU"        "OWEN"        "OBI-WAN"    
##   [11] "HAN"         "GREEDO"      "JABBA"       "DODONNA"     "GOLD LEADER"
##   [16] "WEDGE"       "RED LEADER"  "RED TEN"    
##   
##   $`2`
##   [1] "DARTH VADER" "MOTTI"       "TARKIN"     
## 

There is no better or worse algorithm, we will chose one method or another according to our preferences. In general, fastgreedy is the fastest algorithm.

igraph also makes it very easy to plot the resulting communities:

comm <- cluster_infomap(giant)
modularity(comm) # modularity score
## [1] 0.06420569
par(mar=c(0,0,0,0)) 
plot(comm, giant)

Alternatively, we can also add the membership to different communities as a color parameter in the igraph object. That is, we use the function membership on our identified communities, and use this to change their colour.

V(giant)$color <- membership(comm)
par(mar=c(0,0,0,0)) 
plot(giant)

The final way in which we can think about network communities is in terms of hierarchy or structure.

K-core decomposition allows us to identify the core and the periphery of the network. A k-core is a maximal subnet of a network such that all nodes have at least degree K. To calculate this we will use the function coreness.

coreness(g)
##       R2-D2   CHEWBACCA       C-3PO        LUKE DARTH VADER       CAMIE 
##           6           6           6           6           3           2 
##       BIGGS        LEIA        BERU        OWEN     OBI-WAN       MOTTI 
##           5           6           3           3           6           3 
##      TARKIN         HAN      GREEDO       JABBA     DODONNA GOLD LEADER 
##           3           6           1           1           5           5 
##       WEDGE  RED LEADER     RED TEN   GOLD FIVE 
##           5           5           2           0
which(coreness(g)==6) # what is the core of the network?
##     R2-D2 CHEWBACCA     C-3PO      LUKE      LEIA   OBI-WAN       HAN 
##         1         2         3         4         8        11        14
which(coreness(g)==1) # what is the periphery of the network?
## GREEDO  JABBA 
##     15     16
# Visualizing network structure
V(g)$coreness <- coreness(g)
par(mfrow=c(2, 3), mar=c(0.1,0.1,1,0.1))
set.seed(777) 
fr <- layout_with_fr(g)
for (k in 1:6){
  V(g)$color <- ifelse(V(g)$coreness>=k, "orange", "grey")
  plot(g, main=paste0(k, '-core shell'), layout=fr)
}


Challenge

What communities can you find in the Game of Thrones network? Try different community detection algorithms to see if you get different answers.


Bonus: Analysing networks with factors

Up to this point, we have focused on describing networks, both visually and numerically. In this bonus section we aim to explain how networks emerge: what are the mechanisms that explain the structure of the observed networks?

One of the most basic notions governing network formation is homophily, that is, the propensity of individuals to cluster along common traits, such as age, gender, class, etc. Positive assortment suggests that nodes are more connected than expected, whereas negative assortment suggests avoidance of alike nodes. This can be measured on weighted networks and is a powerful approach for identifying phenotypic structure in social networks. For example, positive assortment by degree (gregariousness) has been linked with rapid spread of information or disease through social networks.

In our data set, we are going to explore the association among the different sides of the force, but first we need to assign the side of each character.

# Assign the attribute side to each of the characters 

g <- set_vertex_attr(g, "side", 
                value = c(rep("light", 4), "dark", rep("light",6), rep("dark",2), "light", rep("other",2), rep("light",6)))

# We can explore that is correct

cbind(V(g)$side, V(g)) %>% 
  DT::datatable()

Now that we have “side of the force” introduced as an attribute we can explore the influence of each side on the network. For example, we can calculate the homophily for the whole network and for each side of the force. We can measure the extent to which a network is homophilic along a specific variable by computing the assortativity coeficient. The assortativity coefficient measures the level of homophyly of the graph, based on some vertex labeling or values assigned to vertices. If the coefficient is high, that means that connected vertices tend to have the same labels or similar assigned values.

# Whole network
assortativity_degree(g, directed=FALSE)
## [1] -0.1934052
# By side
assortativity_nominal(g, factor(V(g)$side))
## [1] 0.4055202

The main limitation with this approach is that we don’t know to what extent this coefficients are different from what you would find simply by chance in any network. Furthermore, it is hard to disentangle what is the variable that is driving homophily. For example, the proportion of light side is higher than any of the other sides, and thus the homophily result for side could be simply due to a bias in the data.

We can perform a randomization procedure to determine how likely the observed assortativity in our network is given the side of the force of the characters. We can randomly permute the side of the force of the characters in the network 1000 times and recalculate the assortativity for each random network.

# Convert the side of the force attribute into a numeric value

values <- as.numeric(factor(V(g)$side))

# Calculate the observed assortativity

observed.assortativity <- assortativity(g, values)

# Calculate the assortativity of the network randomizing the side attribute 1000 times
results <- vector('list', 1000)
for(i in 1:1000){
  results[[i]] <- assortativity(g, sample(values))
}

# Plot the distribution of assortativity values and add a red vertical line at the original observed value
hist(unlist(results))
abline(v = observed.assortativity, col = "red", lty = 3, lwd=2)

Given that our value is well outside the permutation area, we can see that our assortativity is not due to chance.

Session info

For ease of reproducibility

sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] DT_0.27         forcats_0.5.2   stringr_1.5.0   dplyr_1.0.10   
##  [5] purrr_1.0.1     readr_2.1.3     tidyr_1.3.0     tibble_3.1.8   
##  [9] ggplot2_3.4.0   tidyverse_1.3.2 igraph_1.3.5   
## 
## loaded via a namespace (and not attached):
##  [1] lattice_0.20-45     lubridate_1.9.1     assertthat_0.2.1   
##  [4] digest_0.6.31       utf8_1.2.2          R6_2.5.1           
##  [7] cellranger_1.1.0    backports_1.4.1     reprex_2.0.2       
## [10] evaluate_0.20       highr_0.10          httr_1.4.4         
## [13] pillar_1.8.1        rlang_1.0.6         googlesheets4_1.0.1
## [16] readxl_1.4.1        rstudioapi_0.14     jquerylib_0.1.4    
## [19] Matrix_1.5-1        rmarkdown_2.20      googledrive_2.0.0  
## [22] htmlwidgets_1.6.1   munsell_0.5.0       broom_1.0.2        
## [25] compiler_4.2.2      modelr_0.1.10       xfun_0.36          
## [28] pkgconfig_2.0.3     htmltools_0.5.4     tidyselect_1.2.0   
## [31] fansi_1.0.4         crayon_1.5.2        tzdb_0.3.0         
## [34] dbplyr_2.3.0        withr_2.5.0         grid_4.2.2         
## [37] jsonlite_1.8.4      gtable_0.3.1        lifecycle_1.0.3    
## [40] DBI_1.1.3           magrittr_2.0.3      scales_1.2.1       
## [43] cli_3.6.0           stringi_1.7.12      cachem_1.0.6       
## [46] fs_1.6.0            xml2_1.3.3          bslib_0.4.2        
## [49] ellipsis_0.3.2      generics_0.1.3      vctrs_0.5.2        
## [52] tools_4.2.2         glue_1.6.2          crosstalk_1.2.0    
## [55] hms_1.1.2           fastmap_1.1.0       yaml_2.3.7         
## [58] timechange_0.2.0    colorspace_2.1-0    gargle_1.2.1       
## [61] rvest_1.0.3         knitr_1.41          haven_2.5.1        
## [64] sass_0.4.5

  1. Universitat de Barcelona, ↩︎